source("pairs.r")
library(dplyr)
Registered S3 method overwritten by 'dplyr':
method from
print.rowwise_df
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
library(ggplot2)
library(pastecs)
Attaching package: ‘pastecs’
The following objects are masked from ‘package:dplyr’:
first, last
library(psych)
Attaching package: ‘psych’
The following objects are masked from ‘package:ggplot2’:
%+%, alpha
library(Amelia)
Loading required package: Rcpp
##
## Amelia II: Multiple Imputation
## (Version 1.7.5, built: 2018-05-07)
## Copyright (C) 2005-2019 James Honaker, Gary King and Matthew Blackwell
## Refer to http://gking.harvard.edu/amelia/ for more information
##
library(mlbench)
library(corrplot)
corrplot 0.84 loaded
library(caret)
Loading required package: lattice
Registered S3 method overwritten by 'data.table':
method from
print.data.table
library(readr)
library(gridExtra)
Attaching package: ‘gridExtra’
The following object is masked from ‘package:dplyr’:
combine
library(grid)
library(ggplot2)
library(lattice)
library(leaps)
data <- read_csv("data/WA_Fn-UseC_-HR-Employee-Attrition.csv")
Parsed with column specification:
cols(
.default = col_double(),
Attrition = [31mcol_character()[39m,
BusinessTravel = [31mcol_character()[39m,
Department = [31mcol_character()[39m,
EducationField = [31mcol_character()[39m,
Gender = [31mcol_character()[39m,
JobRole = [31mcol_character()[39m,
MaritalStatus = [31mcol_character()[39m,
Over18 = [31mcol_character()[39m,
OverTime = [31mcol_character()[39m
)
See spec(...) for full column specifications.
head(data)
#####################################
##
## Reformat the data so that it is
## 1) Easy to use (add nice column names)
## 2) Interpreted correctly by glm()..
##
#####################################
names(data)
[1] "Age" "Attrition" "BusinessTravel" "DailyRate"
[5] "Department" "DistanceFromHome" "Education" "EducationField"
[9] "EmployeeCount" "EmployeeNumber" "EnvironmentSatisfaction" "Gender"
[13] "HourlyRate" "JobInvolvement" "JobLevel" "JobRole"
[17] "JobSatisfaction" "MaritalStatus" "MonthlyIncome" "MonthlyRate"
[21] "NumCompaniesWorked" "Over18" "OverTime" "PercentSalaryHike"
[25] "PerformanceRating" "RelationshipSatisfaction" "StandardHours" "StockOptionLevel"
[29] "TotalWorkingYears" "TrainingTimesLastYear" "WorkLifeBalance" "YearsAtCompany"
[33] "YearsInCurrentRole" "YearsSinceLastPromotion" "YearsWithCurrManager"
summary(data)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome Education
Min. :18.00 Length:1470 Length:1470 Min. : 102.0 Length:1470 Min. : 1.000 Min. :1.000
1st Qu.:30.00 Class :character Class :character 1st Qu.: 465.0 Class :character 1st Qu.: 2.000 1st Qu.:2.000
Median :36.00 Mode :character Mode :character Median : 802.0 Mode :character Median : 7.000 Median :3.000
Mean :36.92 Mean : 802.5 Mean : 9.193 Mean :2.913
3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000 3rd Qu.:4.000
Max. :60.00 Max. :1499.0 Max. :29.000 Max. :5.000
EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate
Length:1470 Min. :1 Min. : 1.0 Min. :1.000 Length:1470 Min. : 30.00
Class :character 1st Qu.:1 1st Qu.: 491.2 1st Qu.:2.000 Class :character 1st Qu.: 48.00
Mode :character Median :1 Median :1020.5 Median :3.000 Mode :character Median : 66.00
Mean :1 Mean :1024.9 Mean :2.722 Mean : 65.89
3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.:4.000 3rd Qu.: 83.75
Max. :1 Max. :2068.0 Max. :4.000 Max. :100.00
JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome MonthlyRate
Min. :1.00 Min. :1.000 Length:1470 Min. :1.000 Length:1470 Min. : 1009 Min. : 2094
1st Qu.:2.00 1st Qu.:1.000 Class :character 1st Qu.:2.000 Class :character 1st Qu.: 2911 1st Qu.: 8047
Median :3.00 Median :2.000 Mode :character Median :3.000 Mode :character Median : 4919 Median :14236
Mean :2.73 Mean :2.064 Mean :2.729 Mean : 6503 Mean :14313
3rd Qu.:3.00 3rd Qu.:3.000 3rd Qu.:4.000 3rd Qu.: 8379 3rd Qu.:20462
Max. :4.00 Max. :5.000 Max. :4.000 Max. :19999 Max. :26999
NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction
Min. :0.000 Length:1470 Length:1470 Min. :11.00 Min. :3.000 Min. :1.000
1st Qu.:1.000 Class :character Class :character 1st Qu.:12.00 1st Qu.:3.000 1st Qu.:2.000
Median :2.000 Mode :character Mode :character Median :14.00 Median :3.000 Median :3.000
Mean :2.693 Mean :15.21 Mean :3.154 Mean :2.712
3rd Qu.:4.000 3rd Qu.:18.00 3rd Qu.:3.000 3rd Qu.:4.000
Max. :9.000 Max. :25.00 Max. :4.000 Max. :4.000
StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
Min. :80 Min. :0.0000 Min. : 0.00 Min. :0.000 Min. :1.000 Min. : 0.000 Min. : 0.000
1st Qu.:80 1st Qu.:0.0000 1st Qu.: 6.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.: 3.000 1st Qu.: 2.000
Median :80 Median :1.0000 Median :10.00 Median :3.000 Median :3.000 Median : 5.000 Median : 3.000
Mean :80 Mean :0.7939 Mean :11.28 Mean :2.799 Mean :2.761 Mean : 7.008 Mean : 4.229
3rd Qu.:80 3rd Qu.:1.0000 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
Max. :80 Max. :3.0000 Max. :40.00 Max. :6.000 Max. :4.000 Max. :40.000 Max. :18.000
YearsSinceLastPromotion YearsWithCurrManager
Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 2.000
Median : 1.000 Median : 3.000
Mean : 2.188 Mean : 4.123
3rd Qu.: 3.000 3rd Qu.: 7.000
Max. :15.000 Max. :17.000
data$Attrition <- ifelse(data$Attrition == "Yes", 1, 0)
data$Attrition <- factor(data$Attrition, levels = c(0, 1))
data$Over18 <- ifelse(data$Over18 == "Y", 1, 0)
data$Over18 <- factor(data$Over18, levels = c(0, 1))
data$OverTime <- ifelse(data$OverTime == "Yes", 1, 0)
data$OverTime <- factor(data$OverTime, levels = c(0, 1))
data$BusinessTravel<-factor(data$BusinessTravel)
data$Department<-factor(data$Department)
data$EducationField<-factor(data$EducationField)
data$Gender<-factor(data$Gender)
data$MaritalStatus<-factor(data$MaritalStatus)
data$JobRole<-factor(data$JobRole)
data$Education<-factor(data$Education, order = TRUE, levels=c(1,2,3,4,5))
data$EnvironmentSatisfaction<-factor(data$EnvironmentSatisfaction, order=TRUE, levels=c(1,2,3,4))
data$JobInvolvement<-factor(data$JobInvolvement, order=TRUE, levels=c(1,2,3,4))
data$JobSatisfaction<-factor(data$JobSatisfaction, order=TRUE, levels=c(1,2,3,4))
data$PerformanceRating<-factor(data$PerformanceRating, order=TRUE, levels=c(1,2,3,4))
data$RelationshipSatisfaction<-factor(data$RelationshipSatisfaction, order=TRUE, levels=c(1,2,3,4))
data$WorkLifeBalance<-factor(data$WorkLifeBalance, order=TRUE, levels=c(1,2,3,4))
data$StockOptionLevel<-factor(data$StockOptionLevel, order=TRUE, levels=c(0,1,2,3))
str(data)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1470 obs. of 35 variables:
$ Age : num 41 49 37 33 27 32 59 30 38 36 ...
$ Attrition : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 1 1 ...
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
$ DailyRate : num 1102 279 1373 1392 591 ...
$ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
$ DistanceFromHome : num 1 8 2 3 2 2 3 24 23 27 ...
$ Education : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 2 1 2 4 1 2 3 1 3 3 ...
$ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
$ EmployeeCount : num 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeNumber : num 1 2 4 5 7 8 10 11 12 13 ...
$ EnvironmentSatisfaction : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 3 4 4 1 4 3 4 4 3 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ HourlyRate : num 94 61 92 56 40 79 81 67 44 94 ...
$ JobInvolvement : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 3 2 2 3 3 3 4 3 2 3 ...
$ JobLevel : num 2 2 1 1 1 1 1 1 3 2 ...
$ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
$ JobSatisfaction : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 4 2 3 3 2 4 1 3 3 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
$ MonthlyIncome : num 5993 5130 2090 2909 3468 ...
$ MonthlyRate : num 19479 24907 2396 23159 16632 ...
$ NumCompaniesWorked : num 8 1 6 1 9 0 4 1 0 6 ...
$ Over18 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ OverTime : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 1 1 ...
$ PercentSalaryHike : num 11 23 15 11 12 13 20 22 21 13 ...
$ PerformanceRating : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 3 4 3 3 3 3 4 4 4 3 ...
$ RelationshipSatisfaction: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 4 2 3 4 3 1 2 2 2 ...
$ StandardHours : num 80 80 80 80 80 80 80 80 80 80 ...
$ StockOptionLevel : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 2 1 1 2 1 4 2 1 3 ...
$ TotalWorkingYears : num 8 10 7 8 6 8 12 1 10 17 ...
$ TrainingTimesLastYear : num 0 3 3 3 3 2 3 2 2 3 ...
$ WorkLifeBalance : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 3 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : num 6 10 0 8 2 7 1 1 9 7 ...
$ YearsInCurrentRole : num 4 7 0 7 2 7 0 0 7 7 ...
$ YearsSinceLastPromotion : num 0 1 0 3 2 3 0 0 1 7 ...
$ YearsWithCurrManager : num 5 7 0 0 2 6 0 0 8 7 ...
- attr(*, "spec")=
.. cols(
.. Age = [32mcol_double()[39m,
.. Attrition = [31mcol_character()[39m,
.. BusinessTravel = [31mcol_character()[39m,
.. DailyRate = [32mcol_double()[39m,
.. Department = [31mcol_character()[39m,
.. DistanceFromHome = [32mcol_double()[39m,
.. Education = [32mcol_double()[39m,
.. EducationField = [31mcol_character()[39m,
.. EmployeeCount = [32mcol_double()[39m,
.. EmployeeNumber = [32mcol_double()[39m,
.. EnvironmentSatisfaction = [32mcol_double()[39m,
.. Gender = [31mcol_character()[39m,
.. HourlyRate = [32mcol_double()[39m,
.. JobInvolvement = [32mcol_double()[39m,
.. JobLevel = [32mcol_double()[39m,
.. JobRole = [31mcol_character()[39m,
.. JobSatisfaction = [32mcol_double()[39m,
.. MaritalStatus = [31mcol_character()[39m,
.. MonthlyIncome = [32mcol_double()[39m,
.. MonthlyRate = [32mcol_double()[39m,
.. NumCompaniesWorked = [32mcol_double()[39m,
.. Over18 = [31mcol_character()[39m,
.. OverTime = [31mcol_character()[39m,
.. PercentSalaryHike = [32mcol_double()[39m,
.. PerformanceRating = [32mcol_double()[39m,
.. RelationshipSatisfaction = [32mcol_double()[39m,
.. StandardHours = [32mcol_double()[39m,
.. StockOptionLevel = [32mcol_double()[39m,
.. TotalWorkingYears = [32mcol_double()[39m,
.. TrainingTimesLastYear = [32mcol_double()[39m,
.. WorkLifeBalance = [32mcol_double()[39m,
.. YearsAtCompany = [32mcol_double()[39m,
.. YearsInCurrentRole = [32mcol_double()[39m,
.. YearsSinceLastPromotion = [32mcol_double()[39m,
.. YearsWithCurrManager = [32mcol_double()[39m
.. )
names(data)
[1] "Age" "Attrition" "BusinessTravel" "DailyRate"
[5] "Department" "DistanceFromHome" "Education" "EducationField"
[9] "EmployeeCount" "EmployeeNumber" "EnvironmentSatisfaction" "Gender"
[13] "HourlyRate" "JobInvolvement" "JobLevel" "JobRole"
[17] "JobSatisfaction" "MaritalStatus" "MonthlyIncome" "MonthlyRate"
[21] "NumCompaniesWorked" "Over18" "OverTime" "PercentSalaryHike"
[25] "PerformanceRating" "RelationshipSatisfaction" "StandardHours" "StockOptionLevel"
[29] "TotalWorkingYears" "TrainingTimesLastYear" "WorkLifeBalance" "YearsAtCompany"
[33] "YearsInCurrentRole" "YearsSinceLastPromotion" "YearsWithCurrManager"
stat.desc(data)
describe(data)
str(data)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame': 1470 obs. of 35 variables:
$ Age : num 41 49 37 33 27 32 59 30 38 36 ...
$ Attrition : Factor w/ 2 levels "0","1": 2 1 2 1 1 1 1 1 1 1 ...
$ BusinessTravel : Factor w/ 3 levels "Non-Travel","Travel_Frequently",..: 3 2 3 2 3 2 3 3 2 3 ...
$ DailyRate : num 1102 279 1373 1392 591 ...
$ Department : Factor w/ 3 levels "Human Resources",..: 3 2 2 2 2 2 2 2 2 2 ...
$ DistanceFromHome : num 1 8 2 3 2 2 3 24 23 27 ...
$ Education : Ord.factor w/ 5 levels "1"<"2"<"3"<"4"<..: 2 1 2 4 1 2 3 1 3 3 ...
$ EducationField : Factor w/ 6 levels "Human Resources",..: 2 2 5 2 4 2 4 2 2 4 ...
$ EmployeeCount : num 1 1 1 1 1 1 1 1 1 1 ...
$ EmployeeNumber : num 1 2 4 5 7 8 10 11 12 13 ...
$ EnvironmentSatisfaction : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 2 3 4 4 1 4 3 4 4 3 ...
$ Gender : Factor w/ 2 levels "Female","Male": 1 2 2 1 2 2 1 2 2 2 ...
$ HourlyRate : num 94 61 92 56 40 79 81 67 44 94 ...
$ JobInvolvement : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 3 2 2 3 3 3 4 3 2 3 ...
$ JobLevel : num 2 2 1 1 1 1 1 1 3 2 ...
$ JobRole : Factor w/ 9 levels "Healthcare Representative",..: 8 7 3 7 3 3 3 3 5 1 ...
$ JobSatisfaction : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 4 2 3 3 2 4 1 3 3 3 ...
$ MaritalStatus : Factor w/ 3 levels "Divorced","Married",..: 3 2 3 2 2 3 2 1 3 2 ...
$ MonthlyIncome : num 5993 5130 2090 2909 3468 ...
$ MonthlyRate : num 19479 24907 2396 23159 16632 ...
$ NumCompaniesWorked : num 8 1 6 1 9 0 4 1 0 6 ...
$ Over18 : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
$ OverTime : Factor w/ 2 levels "0","1": 2 1 2 2 1 1 2 1 1 1 ...
$ PercentSalaryHike : num 11 23 15 11 12 13 20 22 21 13 ...
$ PerformanceRating : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 3 4 3 3 3 3 4 4 4 3 ...
$ RelationshipSatisfaction: Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 4 2 3 4 3 1 2 2 2 ...
$ StandardHours : num 80 80 80 80 80 80 80 80 80 80 ...
$ StockOptionLevel : Ord.factor w/ 4 levels "0"<"1"<"2"<"3": 1 2 1 1 2 1 4 2 1 3 ...
$ TotalWorkingYears : num 8 10 7 8 6 8 12 1 10 17 ...
$ TrainingTimesLastYear : num 0 3 3 3 3 2 3 2 2 3 ...
$ WorkLifeBalance : Ord.factor w/ 4 levels "1"<"2"<"3"<"4": 1 3 3 3 3 2 2 3 3 2 ...
$ YearsAtCompany : num 6 10 0 8 2 7 1 1 9 7 ...
$ YearsInCurrentRole : num 4 7 0 7 2 7 0 0 7 7 ...
$ YearsSinceLastPromotion : num 0 1 0 3 2 3 0 0 1 7 ...
$ YearsWithCurrManager : num 5 7 0 0 2 6 0 0 8 7 ...
- attr(*, "spec")=
.. cols(
.. Age = [32mcol_double()[39m,
.. Attrition = [31mcol_character()[39m,
.. BusinessTravel = [31mcol_character()[39m,
.. DailyRate = [32mcol_double()[39m,
.. Department = [31mcol_character()[39m,
.. DistanceFromHome = [32mcol_double()[39m,
.. Education = [32mcol_double()[39m,
.. EducationField = [31mcol_character()[39m,
.. EmployeeCount = [32mcol_double()[39m,
.. EmployeeNumber = [32mcol_double()[39m,
.. EnvironmentSatisfaction = [32mcol_double()[39m,
.. Gender = [31mcol_character()[39m,
.. HourlyRate = [32mcol_double()[39m,
.. JobInvolvement = [32mcol_double()[39m,
.. JobLevel = [32mcol_double()[39m,
.. JobRole = [31mcol_character()[39m,
.. JobSatisfaction = [32mcol_double()[39m,
.. MaritalStatus = [31mcol_character()[39m,
.. MonthlyIncome = [32mcol_double()[39m,
.. MonthlyRate = [32mcol_double()[39m,
.. NumCompaniesWorked = [32mcol_double()[39m,
.. Over18 = [31mcol_character()[39m,
.. OverTime = [31mcol_character()[39m,
.. PercentSalaryHike = [32mcol_double()[39m,
.. PerformanceRating = [32mcol_double()[39m,
.. RelationshipSatisfaction = [32mcol_double()[39m,
.. StandardHours = [32mcol_double()[39m,
.. StockOptionLevel = [32mcol_double()[39m,
.. TotalWorkingYears = [32mcol_double()[39m,
.. TrainingTimesLastYear = [32mcol_double()[39m,
.. WorkLifeBalance = [32mcol_double()[39m,
.. YearsAtCompany = [32mcol_double()[39m,
.. YearsInCurrentRole = [32mcol_double()[39m,
.. YearsSinceLastPromotion = [32mcol_double()[39m,
.. YearsWithCurrManager = [32mcol_double()[39m
.. )
summary(data)
Age Attrition BusinessTravel DailyRate Department DistanceFromHome
Min. :18.00 0:1233 Non-Travel : 150 Min. : 102.0 Human Resources : 63 Min. : 1.000
1st Qu.:30.00 1: 237 Travel_Frequently: 277 1st Qu.: 465.0 Research & Development:961 1st Qu.: 2.000
Median :36.00 Travel_Rarely :1043 Median : 802.0 Sales :446 Median : 7.000
Mean :36.92 Mean : 802.5 Mean : 9.193
3rd Qu.:43.00 3rd Qu.:1157.0 3rd Qu.:14.000
Max. :60.00 Max. :1499.0 Max. :29.000
Education EducationField EmployeeCount EmployeeNumber EnvironmentSatisfaction Gender HourlyRate
1:170 Human Resources : 27 Min. :1 Min. : 1.0 1:284 Female:588 Min. : 30.00
2:282 Life Sciences :606 1st Qu.:1 1st Qu.: 491.2 2:287 Male :882 1st Qu.: 48.00
3:572 Marketing :159 Median :1 Median :1020.5 3:453 Median : 66.00
4:398 Medical :464 Mean :1 Mean :1024.9 4:446 Mean : 65.89
5: 48 Other : 82 3rd Qu.:1 3rd Qu.:1555.8 3rd Qu.: 83.75
Technical Degree:132 Max. :1 Max. :2068.0 Max. :100.00
JobInvolvement JobLevel JobRole JobSatisfaction MaritalStatus MonthlyIncome
1: 83 Min. :1.000 Sales Executive :326 1:289 Divorced:327 Min. : 1009
2:375 1st Qu.:1.000 Research Scientist :292 2:280 Married :673 1st Qu.: 2911
3:868 Median :2.000 Laboratory Technician :259 3:442 Single :470 Median : 4919
4:144 Mean :2.064 Manufacturing Director :145 4:459 Mean : 6503
3rd Qu.:3.000 Healthcare Representative:131 3rd Qu.: 8379
Max. :5.000 Manager :102 Max. :19999
(Other) :215
MonthlyRate NumCompaniesWorked Over18 OverTime PercentSalaryHike PerformanceRating RelationshipSatisfaction
Min. : 2094 Min. :0.000 0: 0 0:1054 Min. :11.00 1: 0 1:276
1st Qu.: 8047 1st Qu.:1.000 1:1470 1: 416 1st Qu.:12.00 2: 0 2:303
Median :14236 Median :2.000 Median :14.00 3:1244 3:459
Mean :14313 Mean :2.693 Mean :15.21 4: 226 4:432
3rd Qu.:20462 3rd Qu.:4.000 3rd Qu.:18.00
Max. :26999 Max. :9.000 Max. :25.00
StandardHours StockOptionLevel TotalWorkingYears TrainingTimesLastYear WorkLifeBalance YearsAtCompany YearsInCurrentRole
Min. :80 0:631 Min. : 0.00 Min. :0.000 1: 80 Min. : 0.000 Min. : 0.000
1st Qu.:80 1:596 1st Qu.: 6.00 1st Qu.:2.000 2:344 1st Qu.: 3.000 1st Qu.: 2.000
Median :80 2:158 Median :10.00 Median :3.000 3:893 Median : 5.000 Median : 3.000
Mean :80 3: 85 Mean :11.28 Mean :2.799 4:153 Mean : 7.008 Mean : 4.229
3rd Qu.:80 3rd Qu.:15.00 3rd Qu.:3.000 3rd Qu.: 9.000 3rd Qu.: 7.000
Max. :80 Max. :40.00 Max. :6.000 Max. :40.000 Max. :18.000
YearsSinceLastPromotion YearsWithCurrManager
Min. : 0.000 Min. : 0.000
1st Qu.: 0.000 1st Qu.: 2.000
Median : 1.000 Median : 3.000
Mean : 2.188 Mean : 4.123
3rd Qu.: 3.000 3rd Qu.: 7.000
Max. :15.000 Max. :17.000
# attach(data)
for(i in 1:length(data)) {
plot(data$Attrition, eval(parse(text=paste("data",names(data)[i],sep="$"))), xlab = "Attrition", ylab = names(data)[i])
}
help(missmap)
options(repr.plot.width = 24, repr.plot.height = 24)
missmap(data, col=c("blue", "red"), legend=TRUE)
the condition has length > 1 and only the first element will be usedUnknown or uninitialised column: 'arguments'.Unknown or uninitialised column: 'arguments'.Unknown or uninitialised column: 'imputations'.
options(repr.plot.width = 16, repr.plot.height = 16)
num_data <- data[, sapply(data, is.numeric)]
stat.desc(num_data)
correlations <- cor(num_data)
the standard deviation is zero
corrplot(correlations, method="circle")
# View(data)
plot(data[c(2,1,3,4,5)])
plot(data[c(2,6,7,8,9)])
plot(data[c(2,10,11,12,13)])
plot(data[c(2,14,15,16,17)])
plot(data[c(2,18,19,20,21)])
plot(data[c(2,22,23,24,25)])
plot(data[c(2,26,27,28,29)])
plot(data[c(2,30,31,32,33)])
plot(data[c(2,34,35)])
cat_data <- data[, sapply(data, is.factor)]
summary(cat_data)
Attrition BusinessTravel Department Education EducationField EnvironmentSatisfaction
0:1233 Non-Travel : 150 Human Resources : 63 1:170 Human Resources : 27 1:284
1: 237 Travel_Frequently: 277 Research & Development:961 2:282 Life Sciences :606 2:287
Travel_Rarely :1043 Sales :446 3:572 Marketing :159 3:453
4:398 Medical :464 4:446
5: 48 Other : 82
Technical Degree:132
Gender JobInvolvement JobRole JobSatisfaction MaritalStatus Over18 OverTime
Female:588 1: 83 Sales Executive :326 1:289 Divorced:327 0: 0 0:1054
Male :882 2:375 Research Scientist :292 2:280 Married :673 1:1470 1: 416
3:868 Laboratory Technician :259 3:442 Single :470
4:144 Manufacturing Director :145 4:459
Healthcare Representative:131
Manager :102
(Other) :215
PerformanceRating RelationshipSatisfaction StockOptionLevel WorkLifeBalance
1: 0 1:276 0:631 1: 80
2: 0 2:303 1:596 2:344
3:1244 3:459 2:158 3:893
4: 226 4:432 3: 85 4:153
ggplot(data=data, aes(Attrition)) + geom_histogram(stat="count") + labs(x="Attrition")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(Age)) + geom_histogram(binwidth=5) + labs(x="Age")
ggplot(data=data, aes(BusinessTravel)) + geom_histogram(stat="count") + labs(x="Business Travel")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(DailyRate)) + geom_histogram(binwidth=15) + labs(x="Daily Rate")
ggplot(data=data, aes(Department)) + geom_histogram(stat="count") + labs(x="Department")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(DistanceFromHome)) + geom_histogram(binwidth=5) + labs(x="Distance from Home")
ggplot(data=data, aes(Education)) + geom_histogram(stat="count") + labs(x="Education")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(EducationField)) + geom_histogram(stat="count") + labs(x="Education Field")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(EmployeeCount)) + geom_histogram(binwidth=1) + labs(x="Employee Count")
ggplot(data=data, aes(EmployeeNumber)) + geom_histogram(binwidth=20) + labs(x="Employee Number")
ggplot(data=data, aes(EnvironmentSatisfaction)) + geom_histogram(stat="count") + labs(x="Environment Satisfaction")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(Gender)) + geom_histogram(stat="count") + labs(x="Gender")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(HourlyRate)) + geom_histogram(binwidth=5) + labs(x="Hourly Rate")
ggplot(data=data, aes(JobInvolvement)) + geom_histogram(stat="count") + labs(x="Job Involvement")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(JobLevel)) + geom_histogram(stat="count") + labs(x="Job Level")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(JobRole)) + geom_histogram(stat="count") + labs(x="Job Role")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(JobSatisfaction)) + geom_histogram(stat="count") + labs(x="Job Satisfaction")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(MaritalStatus)) + geom_histogram(stat="count") + labs(x="Marital Status")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(MonthlyIncome)) + geom_histogram(binwidth=50) + labs(x="Monthly Income")
ggplot(data=data, aes(MonthlyRate)) + geom_histogram(binwidth=50) + labs(x="Monthly Rate")
ggplot(data=data, aes(NumCompaniesWorked)) + geom_histogram(binwidth=1) + labs(x="Num Companies Worked")
ggplot(data=data, aes(Over18)) + geom_histogram(stat="count") + labs(x="Over 18")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(PercentSalaryHike)) + geom_histogram(binwidth=5) + labs(x="Percent Salary Hike")
ggplot(data=data, aes(PerformanceRating)) + geom_histogram(stat="count") + labs(x="Performance Rating")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(RelationshipSatisfaction)) + geom_histogram(stat="count") + labs(x="Relationship Satisfaction")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(StandardHours)) + geom_histogram(binwidth=5) + labs(x="Standard Hours")
ggplot(data=data, aes(StockOptionLevel)) + geom_histogram(stat="count") + labs(x="Stock Option Level")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(TotalWorkingYears)) + geom_histogram(binwidth=5) + labs(x="Total Working Years")
ggplot(data=data, aes(TrainingTimesLastYear)) + geom_histogram(binwidth=5) + labs(x="Training Times Last Year")
ggplot(data=data, aes(WorkLifeBalance)) + geom_histogram(stat="count") + labs(x="Work Life Balance")
Ignoring unknown parameters: binwidth, bins, pad
ggplot(data=data, aes(YearsAtCompany)) + geom_histogram(binwidth=2) + labs(x="Years At Company")
ggplot(data=data, aes(YearsSinceLastPromotion)) + geom_histogram(binwidth=2) + labs(x="Years Since Last Promotion")
ggplot(data=data, aes(YearsWithCurrManager)) + geom_histogram(binwidth=2) + labs(x="Years With Curr Manager")
#######################################
## BUILD OUR LOGISTIC MODEL - logmod
#######################################
# xtabs(~Attrition + Age, data=data)
xtabs(~Attrition + BusinessTravel, data=data)
BusinessTravel
Attrition Non-Travel Travel_Frequently Travel_Rarely
0 138 208 887
1 12 69 156
# xtabs(~Attrition + DailyRate, data=data)
xtabs(~Attrition + factor(Department), data=data)
factor(Department)
Attrition Human Resources Research & Development Sales
0 51 828 354
1 12 133 92
# xtabs(~Attrition + DistanceFromHome, data=data)
xtabs(~Attrition + factor(Education), data=data)
factor(Education)
Attrition 1 2 3 4 5
0 139 238 473 340 43
1 31 44 99 58 5
xtabs(~Attrition + EducationField, data=data)
EducationField
Attrition Human Resources Life Sciences Marketing Medical Other Technical Degree
0 20 517 124 401 71 100
1 7 89 35 63 11 32
xtabs(~Attrition + EmployeeCount, data=data)
EmployeeCount
Attrition 1
0 1233
1 237
# xtabs(~Attrition + EmployeeNumber, data=data)
xtabs(~Attrition + factor(EnvironmentSatisfaction), data=data)
factor(EnvironmentSatisfaction)
Attrition 1 2 3 4
0 212 244 391 386
1 72 43 62 60
xtabs(~Attrition + factor(Gender), data=data)
factor(Gender)
Attrition Female Male
0 501 732
1 87 150
# xtabs(~Attrition + HourlyRate, data=data)
xtabs(~Attrition + factor(JobInvolvement), data=data)
factor(JobInvolvement)
Attrition 1 2 3 4
0 55 304 743 131
1 28 71 125 13
xtabs(~Attrition + factor(JobLevel), data=data)
factor(JobLevel)
Attrition 1 2 3 4 5
0 400 482 186 101 64
1 143 52 32 5 5
xtabs(~Attrition + factor(JobRole), data=data)
factor(JobRole)
Attrition Healthcare Representative Human Resources Laboratory Technician Manager Manufacturing Director Research Director
0 122 40 197 97 135 78
1 9 12 62 5 10 2
factor(JobRole)
Attrition Research Scientist Sales Executive Sales Representative
0 245 269 50
1 47 57 33
xtabs(~Attrition + factor(JobSatisfaction), data=data)
factor(JobSatisfaction)
Attrition 1 2 3 4
0 223 234 369 407
1 66 46 73 52
xtabs(~Attrition + factor(MaritalStatus), data=data)
factor(MaritalStatus)
Attrition Divorced Married Single
0 294 589 350
1 33 84 120
# xtabs(~Attrition + MonthlyIncome, data=data)
# xtabs(~Attrition + MonthlyRate, data=data)
xtabs(~Attrition + NumCompaniesWorked, data=data)
NumCompaniesWorked
Attrition 0 1 2 3 4 5 6 7 8 9
0 174 423 130 143 122 47 54 57 43 40
1 23 98 16 16 17 16 16 17 6 12
xtabs(~Attrition + factor(OverTime), data=data)
factor(OverTime)
Attrition 0 1
0 944 289
1 110 127
# xtabs(~Attrition + PercentSalaryHike, data=data)
xtabs(~Attrition + factor(PerformanceRating), data=data)
factor(PerformanceRating)
Attrition 3 4
0 1044 189
1 200 37
xtabs(~Attrition + factor(RelationshipSatisfaction), data=data)
factor(RelationshipSatisfaction)
Attrition 1 2 3 4
0 219 258 388 368
1 57 45 71 64
xtabs(~Attrition + StandardHours, data=data)
StandardHours
Attrition 80
0 1233
1 237
xtabs(~Attrition + factor(StockOptionLevel), data=data)
factor(StockOptionLevel)
Attrition 0 1 2 3
0 477 540 146 70
1 154 56 12 15
# xtabs(~Attrition + TotalWorkingYears, data=data)
xtabs(~Attrition + TrainingTimesLastYear, data=data)
TrainingTimesLastYear
Attrition 0 1 2 3 4 5 6
0 39 62 449 422 97 105 59
1 15 9 98 69 26 14 6
xtabs(~Attrition + factor(WorkLifeBalance), data=data)
factor(WorkLifeBalance)
Attrition 1 2 3 4
0 55 286 766 126
1 25 58 127 27
# xtabs(~Attrition + YearsAtCompany, data=data)
# xtabs(~Attrition + YearsSinceLastPromotion, data=data)
# xtabs(~Attrition + YearsWithCurrManager, data=data)
#####################################
##
## Now we will use all of the data available to predict attrition
##
#####################################
# logmod <- glm(Attrition ~ ., data=data, family="binomial")
logmod <- glm(
factor(Attrition) ~
Age+
factor(BusinessTravel)+
DailyRate+
factor(Department)+
DistanceFromHome+
factor(Education)+
EducationField+
#EmployeeCount+
#EmployeeNumber+
factor(EnvironmentSatisfaction)+
factor(Gender)+
HourlyRate+
factor(JobInvolvement)+
factor(JobLevel)+
factor(JobRole)+
factor(JobSatisfaction)+
factor(MaritalStatus)+
MonthlyIncome+
MonthlyRate+
NumCompaniesWorked+
factor(OverTime)+
PercentSalaryHike+
factor(PerformanceRating)+
factor(RelationshipSatisfaction)+
StandardHours+
factor(StockOptionLevel)+
TotalWorkingYears+
TrainingTimesLastYear+
factor(WorkLifeBalance)+
YearsAtCompany+
YearsSinceLastPromotion+
YearsWithCurrManager
, data=data
, family=binomial
)
summary(logmod)
Call:
glm(formula = factor(Attrition) ~ Age + factor(BusinessTravel) +
DailyRate + factor(Department) + DistanceFromHome + factor(Education) +
EducationField + factor(EnvironmentSatisfaction) + factor(Gender) +
HourlyRate + factor(JobInvolvement) + factor(JobLevel) +
factor(JobRole) + factor(JobSatisfaction) + factor(MaritalStatus) +
MonthlyIncome + MonthlyRate + NumCompaniesWorked + factor(OverTime) +
PercentSalaryHike + factor(PerformanceRating) + factor(RelationshipSatisfaction) +
StandardHours + factor(StockOptionLevel) + TotalWorkingYears +
TrainingTimesLastYear + factor(WorkLifeBalance) + YearsAtCompany +
YearsSinceLastPromotion + YearsWithCurrManager, family = binomial,
data = data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.8292 -0.4431 -0.2028 -0.0612 3.7401
Coefficients: (1 not defined because of singularities)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -1.531e+01 5.904e+02 -0.026 0.979305
Age -3.008e-02 1.440e-02 -2.088 0.036755 *
factor(BusinessTravel)Travel_Frequently 2.079e+00 4.371e-01 4.755 1.98e-06 ***
factor(BusinessTravel)Travel_Rarely 1.036e+00 4.024e-01 2.574 0.010064 *
DailyRate -4.962e-04 2.327e-04 -2.132 0.033013 *
factor(Department)Research & Development 1.439e+01 5.904e+02 0.024 0.980558
factor(Department)Sales 1.363e+01 5.904e+02 0.023 0.981577
DistanceFromHome 5.490e-02 1.151e-02 4.769 1.85e-06 ***
factor(Education).L 1.219e-01 4.094e-01 0.298 0.765893
factor(Education).Q -1.388e-01 3.551e-01 -0.391 0.695898
factor(Education).C 7.611e-02 2.686e-01 0.283 0.776921
factor(Education)^4 -4.938e-02 1.946e-01 -0.254 0.799698
EducationFieldLife Sciences -1.130e+00 8.880e-01 -1.273 0.203124
EducationFieldMarketing -6.325e-01 9.318e-01 -0.679 0.497300
EducationFieldMedical -1.110e+00 8.859e-01 -1.253 0.210282
EducationFieldOther -1.066e+00 9.582e-01 -1.112 0.265935
EducationFieldTechnical Degree -1.800e-02 8.996e-01 -0.020 0.984036
factor(EnvironmentSatisfaction).L -9.969e-01 1.944e-01 -5.129 2.91e-07 ***
factor(EnvironmentSatisfaction).Q 4.574e-01 1.928e-01 2.372 0.017689 *
factor(EnvironmentSatisfaction).C -2.503e-01 1.961e-01 -1.277 0.201737
factor(Gender)Male 4.406e-01 1.956e-01 2.253 0.024271 *
HourlyRate 4.111e-03 4.743e-03 0.867 0.386129
factor(JobInvolvement).L -1.514e+00 3.328e-01 -4.550 5.37e-06 ***
factor(JobInvolvement).Q 3.234e-01 2.686e-01 1.204 0.228628
factor(JobInvolvement).C -2.821e-01 1.826e-01 -1.545 0.122365
factor(JobLevel)2 -1.583e+00 4.822e-01 -3.283 0.001026 **
factor(JobLevel)3 1.544e-01 7.384e-01 0.209 0.834373
factor(JobLevel)4 -1.660e-01 1.239e+00 -0.134 0.893405
factor(JobLevel)5 2.887e+00 1.634e+00 1.767 0.077211 .
factor(JobRole)Human Resources 1.468e+01 5.904e+02 0.025 0.980159
factor(JobRole)Laboratory Technician 5.590e-01 6.083e-01 0.919 0.358105
factor(JobRole)Manager -9.678e-02 1.097e+00 -0.088 0.929723
factor(JobRole)Manufacturing Director 4.485e-01 5.588e-01 0.803 0.422206
factor(JobRole)Research Director -1.786e+00 1.160e+00 -1.540 0.123483
factor(JobRole)Research Scientist -5.607e-01 6.309e-01 -0.889 0.374160
factor(JobRole)Sales Executive 2.100e+00 1.243e+00 1.689 0.091277 .
factor(JobRole)Sales Representative 1.793e+00 1.327e+00 1.351 0.176657
factor(JobSatisfaction).L -8.416e-01 1.936e-01 -4.348 1.37e-05 ***
factor(JobSatisfaction).Q -2.366e-02 1.917e-01 -0.123 0.901737
factor(JobSatisfaction).C -2.858e-01 1.915e-01 -1.492 0.135614
factor(MaritalStatus)Married 2.739e-01 2.890e-01 0.948 0.343378
factor(MaritalStatus)Single 6.421e-01 4.137e-01 1.552 0.120620
MonthlyIncome -1.514e-04 9.516e-05 -1.592 0.111485
MonthlyRate 1.001e-05 1.318e-05 0.759 0.447780
NumCompaniesWorked 2.102e-01 4.125e-02 5.097 3.45e-07 ***
factor(OverTime)1 2.202e+00 2.114e-01 10.418 < 2e-16 ***
PercentSalaryHike -1.698e-02 4.107e-02 -0.414 0.679206
factor(PerformanceRating).L 2.648e-02 2.977e-01 0.089 0.929125
factor(RelationshipSatisfaction).L -6.671e-01 1.902e-01 -3.507 0.000453 ***
factor(RelationshipSatisfaction).Q 4.915e-01 1.982e-01 2.480 0.013154 *
factor(RelationshipSatisfaction).C -1.848e-01 1.998e-01 -0.925 0.355116
StandardHours NA NA NA NA
factor(StockOptionLevel).L -1.990e-01 3.361e-01 -0.592 0.553700
factor(StockOptionLevel).Q 9.243e-01 3.054e-01 3.027 0.002472 **
factor(StockOptionLevel).C -1.199e-01 2.847e-01 -0.421 0.673570
TotalWorkingYears -5.974e-02 3.083e-02 -1.938 0.052655 .
TrainingTimesLastYear -1.842e-01 7.608e-02 -2.421 0.015475 *
factor(WorkLifeBalance).L -8.467e-01 3.018e-01 -2.805 0.005026 **
factor(WorkLifeBalance).Q 7.187e-01 2.477e-01 2.902 0.003707 **
factor(WorkLifeBalance).C 1.071e-01 1.820e-01 0.588 0.556386
YearsAtCompany 4.953e-02 3.848e-02 1.287 0.198034
YearsSinceLastPromotion 1.519e-01 4.458e-02 3.408 0.000653 ***
YearsWithCurrManager -1.750e-01 4.972e-02 -3.520 0.000432 ***
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 1298.58 on 1469 degrees of freedom
Residual deviance: 782.92 on 1408 degrees of freedom
AIC: 906.92
Number of Fisher Scoring iterations: 15
original.data<-data
#balancing data:
#what to do if number of 1's is much smaller than the number of 0's
#find the indeces corresponding to 0s, and 1's
input_ones <- data[which(data$Attrition == 1), ] # all 1's
input_zeros <- data[which(data$Attrition == 0), ] # all 0's
#reduce the number of 0's by selecting a sample at random of the size of the 1's you have.
#you can have a different proportion. I made it 50:50 but use your judgement.
#Perhaps you want 1:2 ratio to include more cases.
#set.seed(100) # for repeatability of samples
#which.zeros<- sample(1:nrow(input_zeros), nrow(input_ones))
#sample.from.zeros<-input_zeros[which.zeros,]
#put the data together:
#balanced.data<-rbind(input_ones,sample.from.zeros)
#Altenatively, you can leave the zeros as they are and resample the 1's
#set.seed(100) # for repeatability of samples
#which.ones<- sample(1:2*nrow(input_ones), nrow(input_zeros),replace=TRUE)
#resample.ones<-input_ones[which.ones,]
#balanced.data<-rbind(resample.ones,input_zeros)
#another way: sake a sample of the same zise for both (p=.5), total samples=3000
# "both" oversamples minority(1's) and undersamples mayority(0's)
library(ROSE)
Loaded ROSE 0.0-3
balanced.data<-ovun.sample(Attrition~.,data=data,method="both",p=0.5,N=3000,seed=1)$data
#data<-original.data
data<-balanced.data
logmod <- glm(
factor(Attrition) ~
Age
+ factor(BusinessTravel) * DistanceFromHome * factor(MaritalStatus) * factor(WorkLifeBalance)
+ DailyRate
+ factor(Department)
+ factor(Education) * EducationField
+ factor(JobInvolvement) * factor(JobLevel) * factor(JobSatisfaction)
+ factor(EnvironmentSatisfaction) * factor(Gender) * factor(JobRole)
+ MonthlyIncome * HourlyRate * factor(OverTime)
+ NumCompaniesWorked
+ factor(PerformanceRating)
+ factor(RelationshipSatisfaction)
+ factor(StockOptionLevel) * TotalWorkingYears * TrainingTimesLastYear
+ PercentSalaryHike * YearsAtCompany * YearsSinceLastPromotion * YearsWithCurrManager
, data=data
, family=binomial
)
glm.fit: fitted probabilities numerically 0 or 1 occurred
#summary(logmod)
AIC(logmod)
[1] 1527.899
BIC(logmod)
[1] 3275.752
Odds Ratios You can also exponentiate the coefficients and interpret them as odds-ratios.
ln a P(Attrition=1)/P(Attrition=0)
R will do this computation for you. To get the exponentiated coefficients, you tell R that you want to exponentiate (exp), and that the object you want to exponentiate is called coefficients and it is part of mylogit (coef(logmod)). We can use the same logic to get odds ratios and their confidence intervals, by exponentiating the confidence intervals from before. To put it all in one table, we use cbind to bind the coefficients and confidence intervals column-wise.
#cbind(exp.coef=exp(logmod$coef),exp(confint(logmod)))
If we use R’s diagnostic plot, the first one is the scatterplot of the residuals, against predicted values (the score actually).
#plot(logmod,which=1)
#plot(predict(logmod),residuals(logmod))
#abline(h=0,lty=2,col="grey")
Why do we have those two lines of points ? Because we predict a probability for a variable taking values 0 or 1. If the tree value is 0, then we always predict more, and residuals have to be negative (the blue points) and if the true value is 1, then we underestimate, and residuals have to be positive (the red points). And of course, there is a monotone relationship. We can see more clearly what’s going on when we use colors.
#plot(predict(logmod),residuals(logmod),col=c("blue","red")[1+data$Attrition])
#abline(h=0,lty=2,col="grey")
Points are exactly on a smooth curve, as a function of the predicted value,
To understand what is going on, let’s fit a curve through those points: lowess regression:
#plot(predict(logmod),residuals(logmod),col=c("blue","red")[1+data$Attrition])
#abline(h=0,lty=2,col="grey")
#lines(lowess(predict(logmod),residuals(logmod)),col="black",lwd=2)
We want the regressed line to be very close to the dotted line. Here we draw a CI around the curve: use loess()
Fit a polynomial surface determined by one or more numerical predictors, using local fitting. Fitting is done locally. That is, for the fit at point x, the fit is made using points in a neighborhood of x, weighted by their distance from x. The size of the neighborhood is controlled by α the default is .75. (loess is a newer formula of lowess)
#plot(predict(logmod),residuals(logmod),col=c("blue","red")[1+data$Attrition])
#abline(h=0,lty=2,col="grey")
#lines(lowess(predict(logmod),residuals(logmod)),col="black",lwd=2)
#rl<-loess(residuals(logmod)~predict(logmod))
#y=predict(rl,se=TRUE)
#segments(predict(logmod),y$fit+2*y$se.fit,predict(logmod),y$fit-2*y$se.fit,col="green")
H0:βj=0 (Xj has no effect on P(Y=1)) vs
Ha:βj≠0
Test statistic:
z=bj/se(bj)
this is approximately N(0,1) under H0
Heart Attack Example: #fitting the logistic model
#summary(logmod)
H0:βp−k+1=⋯=βp=0 reduced model
Ha: full model ( all variables in the model)
This is likelihood ratio test that compares the log likelihoods of the two models
Λ=2 ln L(Full model)/L(Reduced model)
Asymptotically, Λ has a Chi-square distribution. Thus the test statistics is aproximate Chi-square.
Test statistic:
χ2=2 log L(Full model) − 2 log L(Reduced model)
where L()= the likelihood function
Under the null hypothesis, the test statistic is approximately Chi-squared with df=k (number of additional variables in the full model)
The Deviance of a model is
Deviance = D = −2 log L(model)
So, the tests statistic for the above situation is
χ2=D(Reduced model)−D(Full model)
H0: reduced model = no variables (only intercept) β1=⋯βp=0 This is the NULL model
Ha: full model (all variables in the model)
Test statistic:
Deviance= χ^2=-2 log L(reduced)/L(full) = D(reduced)-D(full)
Approximately χ2 with df=n−p
Output from R: Null deviance: 27.726 on 19 degrees of freedom Residual deviance: 18.820 on 17 degrees of freedom AIC: 24.82
#anova(logmod)
D(Null) = 27.726 df=19 D(model with Anger & Anxiety) = 18.820, df=17
Test statistic:
χ2=D(Null)−D(model with Anger & Anxiety)=27.726−18.820=8.906
with df=2, so 8.906 is high. P(χ2>8.906)=
1-pchisq(8.906,2)
[1] 0.01164358
This means that the model with Anger and Anxiety in it is better than the model with no variables.
ll.null <- logmod$null.deviance/-2
ll.proposed <- logmod$deviance/-2
ll.null
[1] -2077.198
ll.proposed
[1] -472.9497
(ll.null - ll.proposed) / ll.null
[1] 0.7723137
1 - pchisq(2*(ll.proposed - ll.null), df=(length(logmod$coefficients)-1))
[1] 0
predicted.data <- data.frame(
probability.of.Attrition=logmod$fitted.values,
Attrition=data$Attrition)
predicted.data <- predicted.data[order(predicted.data$probability.of.Attrition, decreasing=FALSE),]
predicted.data$rank <- 1:nrow(predicted.data)
summary(predicted.data)
probability.of.Attrition Attrition rank
Min. :0.0000000 0:1558 Min. : 1.0
1st Qu.:0.0000474 1:1442 1st Qu.: 750.8
Median :0.4753597 Median :1500.5
Mean :0.4806667 Mean :1500.5
3rd Qu.:0.9801070 3rd Qu.:2250.2
Max. :1.0000000 Max. :3000.0
## Lastly, we can plot the predicted probabilities for each sample having
## heart disease and color by whether or not they actually had heart disease
ggplot(data=predicted.data, aes(x=rank, y=probability.of.Attrition)) +
geom_point(aes(color=Attrition), alpha=1, shape=4, stroke=2) +
xlab("Index") +
ylab("Predicted probability of Attrition")
ggsave("attrition_probabilities.pdf")
Saving 7.29 x 4.51 in image
Predicted values (Y vs observations that were classified as positive Y^=1, or negative Y^=0, at a specified threshold.
S<-predict(logmod,type="response")
Y<-data$Attrition
Ps=(S>.45)*1
rbind(c("","Y^=1", "Y^=0"),
c("Y=1", sum((Ps==1)*(Y==1)), sum((Ps!=1)*(Y==1))),
c("Y=0", sum((Ps==1)*(Y==0)), sum((Ps!=1)*(Y==0))))
[,1] [,2] [,3]
[1,] "" "Y^=1" "Y^=0"
[2,] "Y=1" "1403" "39"
[3,] "Y=0" "113" "1445"
There are number of methods of evaluating whether a logistic model is a good model. One such way is sensitivity and specificity.
Sensitivity and specificity are statistical measures of the performance of a binary classification test, also known in statistics as classification function:
Sensitivity (also called the true positive rate, or the recall in some fields) measures the proportion of actual positives which are correctly identified as such (e.g., the percentage of sick people who are correctly identified as having the condition), and is complementary to the false negative rate. Sensitivity= true positives/(true positive + false negative)
Specificity (also called the true negative rate) measures the proportion of negatives which are correctly identified as such (e.g., the percentage of healthy people who are correctly identified as not having the condition), and is complementary to the false positive rate. Specificity=true negatives/(true negative + false positives)
FP=sum((Ps==1)*(Y==0)) #false positives
TP=sum((Ps==1)*(Y==1)) #true positives
FN=sum((Ps!=1)*(Y==1)) #false neagtive
TN=sum((Ps!=1)*(Y==0)) #true negative
TP/(TP+FN)
[1] 0.9729542
TN/(TN+FP)
[1] 0.9274711
sum((Ps==1)*(Y==0))/sum(Y==0) #false positives
[1] 0.07252888
sum((Ps==1)*(Y==1))/sum(Y==1) #true positives
[1] 0.9729542
#### Let's find optimal subsets of features for the best model
#dim(data)
#help(regsubsets)
#regfit.bwd=regsubsets(
# logmod$formula
# ,data, nvmax=15, nbest=3, method="backward", really.big=T)
#reg.summary<-summary(regfit.bwd)
#names(reg.summary)
#max.rsq=which.max(reg.summary$rsq)
#max.adjr2=which.max(reg.summary$adjr2)
#min.rss=which.min(reg.summary$rss)
#min.cp=which.min(reg.summary$cp)
#plot(reg.summary$rsq, xlab="Number of Variables", ylab="RSquare", type="l")
#points(max.rsq,reg.summary$rsq[max.rsq],col="red",cex=2,pch=20)
#plot(regfit.bwd,scale = "r2")
#plot(regfit.bwd,scale = "adjr2")
The Residual Sum of Squares (RSS) is the sum of the squared distances between your actual versus your predicted values:
Summation_1_to_n{ (yi-yi_hat)^2 }
The actual number you get depends largely on the scale of your response variable. Taken alone, the RSS isn’t so informative.
https://stats.stackexchange.com/a/349246
#plot(reg.summary$rss, xlab="Number of Variables", ylab="RSS", type="l")
#points(min.rss,reg.summary$rss[min.rss],col="red",cex=2,pch=20)
Concerning R2, there is an adjusted version, called Adjusted R-squared, which adjusts the R2 for having too many variables in the model.
#plot(reg.summary$adjr2, xlab="Number of Variables", ylab="adjR2", type="l")
#points(max.adjr2,reg.summary$adjr2[max.adjr2],col="red",cex=2,pch=20)
#max.adjr2
Additionally, there are four other important metrics - AIC, AICc, BIC and Mallows Cp - that are commonly used for model evaluation and selection. These are an unbiased estimate of the model prediction error MSE. The lower these metrics, the better the model.
AIC stands for (Akaike’s Information Criteria), a metric developped by the Japanese Statistician, Hirotugu Akaike, 1970. The basic idea of AIC is to penalize the inclusion of additional variables to a model. It adds a penalty that increases the error when including additional terms. The lower the AIC, the better the model.
AIC(logmod)
[1] 1527.899
AICc is a version of AIC corrected for small sample sizes.
BIC (or Bayesian information criteria) is a variant of AIC with a stronger penalty for including additional variables to the model.
BIC(logmod)
[1] 3275.752
#plot(regfit.bwd,scale = "bic")
# coef(regfit.bwd, 10) #default bic
Mallows Cp: A variant of AIC developed by Colin Mallows.
#plot(reg.summary$cp,xlab="Number of Variables",ylab="Cp",type="l")
#points(min.cp,reg.summary$cp[min.cp],col="red",cex=2,pch=20)
#plot(regfit.bwd,scale = "Cp")
#help(regsubsets)
#regfit.full=regsubsets(
# logmod$formula
# ,data=data, nvmax=15, nbest=3, method="exhaustive")
#summary(regfit.full)
#plot(regfit.full,scale="Cp")
#coef(regfit.full,which.min(summary(regfit.full)$cp))
AIC(logmod)
[1] 1527.899
BIC(logmod)
[1] 3275.752